The Named Entity Recognition (NER) subsystem in KAZU is designed to identify and extract named entities from text. It employs a multifaceted approach, integrating advanced transformer models, rule-based systems, and specialized algorithms for various entity types. The core flow involves initial entity detection by different NER steps, followed by a tokenized word processing stage to convert raw predictions into structured entities, and finally, an optional post-processing phase to refine the extracted entities. This modular design allows for flexibility in integrating diverse NER techniques and ensures robust entity extraction capabilities.
Components
TransformersNERStep
This component is a wrapper around Hugging Face's AutoModelForTokenClassification for Named Entity Recognition (NER). It handles tokenization, batch processing, and prediction using a sliding window approach for long documents. It also includes model optimization features like quantization and Torch Inductor. The raw predictions are then passed to a TokenizedWordProcessor for further processing into structured entities.
TokenizedWordProcessor
This component is responsible for post-processing the token-level predictions from a transformer model into meaningful entities. It uses SpanFinder (either SimpleSpanFinder or MultilabelSpanFinder depending on the configuration) to identify entity spans from tokenized words and then converts these spans into Entity objects, handling offset calculations and optional suffix stripping.
Referenced Source Code
LLMNERStep
This component leverages Large Language Models (LLMs) for Named Entity Recognition. It defines how to interact with an LLM (e.g., Vertex AI) to extract entities based on provided prompts and examples. It handles the parsing of LLM responses into structured entities.
SpacyNerStep
This component integrates spaCy for Named Entity Recognition. It allows the use of pre-trained spaCy models to identify entities within text. It's a simpler NER approach compared to transformer models, often used for its speed and efficiency.
OpsinStep
This component utilizes OPSIN (Open Parser for Systematic IUPAC Nomenclature) to identify and parse chemical names into chemical structures. It's a specialized NER step for chemical entities, converting systematic chemical nomenclature into standardized representations.
SethStep
This component implements the Seth algorithm for identifying and normalizing gene and protein mentions in text. It's a rule-based or dictionary-based NER approach specifically tailored for biological entities.
GLiNERStep
This component integrates GLiNER (General-purpose Language-Independent Named Entity Recognition) for NER. It's designed to be a flexible and language-agnostic NER solution, potentially using different scoring mechanisms for conflict resolution.
EntityPostProcessing
This component provides various post-processing functionalities for refining extracted entities. This includes splitting entities based on conjunction patterns or numerical list patterns, which helps in correcting over-segmented or under-segmented entities.
PredictionScriptMain
This component serves as the main entry point for running predictions using a pre-trained Hugging Face token classification model within the KAZU framework. It initializes the TransformersModelForTokenClassificationNerStep and TokenizedWordProcessor, sets up a processing pipeline, and applies it to input documents.
Referenced Source Code
TrainingProcess
This component encapsulates the logic for training a multi-label Named Entity Recognition (NER) model. It interacts with the TransformersNERStep and TokenizedWordProcessor to prepare data, train the model, and process documents during the training phase.
EvaluationProcess
This component represents the main entry point for evaluating the performance of a Named Entity Recognition (NER) model. Similar to the prediction script, it utilizes the TransformersNERStep and TokenizedWordProcessor to process documents and generate predictions for evaluation.